What We'll Cover

In Week 1, we introduced transformers as "attention machines" and explored a little how they differ from earlier architectures. This session goes deeper: we'll dive back into the transformer architecture that powers models like GPT, Claude, and Gemini, understand how attention mechanisms actually work, and explore cutting-edge innovations like Mixture of Experts that make modern models more efficient.

By the end of this session you should have a greater appreciation of how LLMs work—and why architectural choices matter for research applications.

Some of this may feel uncomfortably mathematical, so if you find yourself worrying about this, you can try to just get the overall gist of what is going on.

🧩 Transformer Architecture Fundamentals

Let's revisit the transformer architecture with more technical depth. While Week 1 gave us the intuition, here we'll understand the actual mechanisms.

This video is in addition to the videos from last week and is also a little technical, but it does give a slightly different description of the 3blue1brown explanations.

📹 Take a look at the 3brown1blue videos before this one, but here is a lecture series on the transformer architecture.

The Core Components

Token embeddings: Converting discrete tokens into continuous vector representations
Positional encoding: Injecting information about token position in the sequence
Attention layers: The heart of the model—learning which tokens matter
Feed-forward networks: Processing attended information within each layer
Layer normalization: Stabilizing training across deep networks
Residual connections: Allowing gradients to flow through many layers

Decoder-Only Architecture

Modern LLMs (GPT, Claude, LLaMA) use decoder-only architectures rather than the original encoder-decoder design.

Autoregressive generation: Predict one token at a time, left to right
Causal masking: Each token can only attend to previous tokens
Simpler architecture: No cross-attention needed
Unified pre-training: Single objective (next-token prediction) for all training

💡 Why decoder-only?

Decoder-only models proved more scalable and effective for general-purpose language understanding and generation. The encoder-decoder design is still used for specific tasks like translation.

📐 The Forward Pass: Input to Output

Here's what happens when you send text to an LLM:

Tokenization: Text → token IDs (e.g., "The cat" → [464, 3797])
Embedding lookup: Each token ID → dense vector (e.g., 4096 dimensions)
Positional encoding: Add position information to each token vector
Transformer layers (repeated N times):
- Multi-head self-attention: tokens attend to previous context
- Feed-forward network: process attended representations
- Residual connections + layer norm after each sub-layer
Final layer norm: Normalize output representations
Output projection: Map to vocabulary size (e.g., 50,000 tokens)
Sampling: Choose next token based on probability distribution

Here is the original paper itself, in case you are feeling particularly brave: "Attention Is All You Need" (Vaswani et al., 2017)

👁️ Attention Mechanisms in Detail

Attention is the fundamental innovation that makes transformers work. Let's understand the different types and how they've evolved.

🔑 The Attention Intuition

When you read "The animal didn't cross the street because it was too tired," your brain automatically knows "it" refers to "the animal," not "the street." You attend to relevant context.

Self-attention mechanisms let transformers do the same thing: for each token, learn which other tokens in the context are relevant, and weight them accordingly when building representations.

Self-Attention

The core mechanism: each token computes attention scores with every other token in the sequence.

Query, Key, Value: Each token produces three vectors through learned projections
Attention scores: Dot product of Query with all Keys (how relevant is each token?)
Softmax normalization: Convert scores to probability distribution
Weighted sum: Combine Values using attention weights

🔍 Scaled Dot-Product

Attention scores are scaled by √d_k (square root of key dimension) to prevent extremely small gradients in softmax for large embedding sizes.

Multi-Head Attention

Instead of one attention mechanism, use many in parallel—each learns different patterns.

Multiple heads: 32-96 attention heads in modern LLMs
Specialized patterns: Each head might learn syntax, semantics, or other relationships
Concatenation: Outputs from all heads combined and projected
Richer representations: Capture multiple types of dependencies simultaneously

Cross-Attention

Used in encoder-decoder models (and multimodal systems): attend to a different sequence.

Query from decoder: "What am I trying to generate?"
Keys/Values from encoder: "What input information is available?"
Translation example: Decoder attends to source language while generating target
Vision-language models: Text decoder attends to image features

📹 I don't think that you can do much better than 3brown1blue to understand attention

⚡ Modern Attention Variants

Standard multi-head attention is computationally expensive. Modern models use optimized variants:

Variant	Key Idea	Benefit	Used In
Multi-Head Attention (MHA)	Each head has its own Q, K, V projections	Rich representations	Original transformers, GPT-3
Multi-Query Attention (MQA)	Share K, V across heads; unique Q per head	Faster inference, less memory	PaLM, some Llama variants
Grouped-Query Attention (GQA)	Multiple heads share K, V in groups	Balance between MHA and MQA	Llama 2, Mistral, GPT-4 (rumored)

Research implication: GQA has become the dominant choice for new models—it provides most of MHA's quality with much better inference efficiency.

📍 Positional Encoding: Teaching Position

Transformers process all tokens in parallel (unlike RNNs which are sequential). But word order matters! Positional encoding solves this problem.

Absolute Positional Encoding

Original approach: add a position-specific vector to each token embedding.

Sinusoidal encoding: Original transformer used sin/cos functions of different frequencies
Learned positions: Some models learn position embeddings during training (like GPT)
Fixed context: Limited to maximum sequence length seen during training
Issue: Doesn't extrapolate well to longer sequences

Relative Positional Encoding

Modern approach: encode the distance between tokens rather than absolute position.

RoPE (Rotary Position Embedding): Rotate Q and K vectors based on position—used in LLaMA, Mistral, GPT-NeoX
ALiBi (Attention with Linear Biases): Add bias to attention scores based on distance—used in BLOOM, MPT
Length extrapolation: Can handle sequences longer than training context
Better generalization: Understands "distance" concept rather than memorizing positions

💡 Why RoPE Dominates

RoPE (Rotary Position Embedding) has become the standard for new LLMs because it:

Encodes relative positions naturally through rotation in complex space
Allows models to extrapolate to longer contexts than seen in training
Maintains computational efficiency (applied during Q/K computation)
Empirically outperforms alternatives on long-context tasks

📄 Again, this is a little mathsy, but see if you can use this, along with ChatGPT to understand positional encodings

Positional Embeddings in Transformers: A Math Guide to RoPE & ALiBi

🔀 Modern Architectural Innovations

The frontier of LLM architecture isn't just about making models bigger—it's about making them smarter. Here are the key innovations driving 2024-2026 models.

💡 The Efficiency Revolution

Modern LLM research focuses on parameter efficiency: getting better performance without proportionally increasing compute costs. The key insight: not every parameter needs to activate for every input.

This shift—from "bigger is better" to "smarter is better"—is driven by Mixture of Experts architectures, attention optimizations, and clever training techniques.

🎯 Mixture of Experts (MoE)

MoE is perhaps the most important architectural innovation in modern LLMs. Instead of using all parameters for every token, route each token to a subset of specialized "expert" networks.

How MoE Works:

Expert networks: Instead of one feed-forward network per layer, have 8-64 expert FFNs
Router network: Small learned network decides which experts process each token
Sparse activation: Only top-K experts (typically 2-8) activate per token
Load balancing: Ensure tokens distribute roughly evenly across experts
Combination: Outputs from active experts weighted and summed

MoE Benefits

Massive parameter count: 100B+ total parameters with only 10-20B active per token
Inference efficiency: Computational cost based on active parameters, not total
Specialization: Different experts can learn different domains/patterns
Scaling: Add more experts without proportional compute increase

MoE Challenges

Training complexity: Load balancing is tricky—some experts might be underutilized
Memory requirements: All experts must fit in GPU memory even if only few are active
Communication overhead: Routing adds latency in distributed systems
Instability: Careful tuning needed to prevent expert collapse

📊 MoE in Production Models

Mixtral 8x7B: 8 experts, 2 active per token → 47B total params, 13B active → performs like 47B dense model at cost of 13B

DeepSeek-V3: 256 experts, 8 active per token → 671B total params, 37B active → competitive with GPT-4 at fraction of inference cost

GPT-4 (rumored): Speculated to use MoE with 8-16 experts, explaining its size vs. inference speed

📹 Mixture of Experts explanation

📄 DeepSeek-V3 Github repo

Here is an open-weights Mixture of Experts model that you can, in theory, download and run (though I wouldn't do this on your laptop.

📏 Understanding Model Scale

When we say "GPT-4 has 1.76 trillion parameters" or "LLaMA 3 70B," what do these numbers actually mean? And is bigger always better?

Parameter Count Breakdown

Parameters are the learned weights in the neural network. They're distributed across:

Embedding layers: Token + position embeddings (vocab_size × embedding_dim)
Attention layers: Q, K, V, output projections for each head, each layer
Feed-forward networks: Two linear layers per transformer layer (typically 4× hidden size)
Layer norms: Small contribution (scale + shift per layer)

🔢 Quick Math

A 7B parameter model might have: 32 layers × (12 attention heads × 4096 dim + 4× FFN expansion) ≈ 7 billion parameters

Parameters vs. Active Parameters

For MoE models, these are different concepts:

Total parameters: All weights in all experts (what's reported as "model size")
Active parameters: Weights used for any single forward pass
Example: Mixtral 8x7B has 47B total params but only 13B active per token
Inference cost: Determined by active parameters, not total

Dense vs. Sparse Architectures

Dense models: All parameters active for every input (GPT-3, Claude, LLaMA 2)
Sparse models (MoE): Subset of parameters active per input (Mixtral, DeepSeek-V3, Switch Transformer)
Trade-off: Sparse models offer better parameter efficiency but increased training complexity
Trend: Major labs moving toward sparse architectures for flagship models

🎯 When Bigger ≠ Better

The relationship between model size and performance is nuanced:

Scenario	Smaller Model Wins	Larger Model Wins
Latency-sensitive applications	✓ Faster inference, lower latency	✗ Slower, requires more compute
Resource-constrained deployment	✓ Runs on smaller GPUs, edge devices	✗ Needs high-end infrastructure
Narrow domain tasks	✓ Can be fine-tuned effectively	Diminishing returns
Complex reasoning	Limited capability	✓ Better at multi-step problems
Rare/specialized knowledge	Likely to hallucinate	✓ More knowledge encoded in parameters
Few-shot learning	Requires more examples	✓ Better in-context learning

Research principle: Match model size to your task. A well-trained 7B model often outperforms a poorly-prompted 70B model. And for many research tasks (data analysis, writing assistance, literature review), mid-size models are sufficient.

🔮 The Current Frontier (Feb 2026)

Dense models: Have been largely superseded by MoE models. However ChatGPT 4o seems to have been a dense model with around 1T parameters

MoE models: Many 200B-700B total parameters with 30B-50B active (DeepSeek-V3, Mixtral variants). Claude 4.6 may be an MoE model with around 700B parameters. ChatGPT 5.2 may have over 10T parameters, with fewer active parameters.

Small models: 7B-13B models (LLaMA 3, Mistral 7B) remain popular for local deployment, fine-tuning, and research experimentation

Trend: Architectural efficiency improvements (MoE, GQA, better training) matter more than raw parameter count

📄 Scaling Laws for LLMS

Understanding the current state of LLM scaling and the future of AI research

📚 Summary & Key Takeaways

You now understand the technical architecture of modern LLMs:

Transformer fundamentals: Decoder-only architecture, layer structure, residual connections
Attention mechanisms: Self-attention, multi-head variants (MHA/MQA/GQA), how tokens learn relevance
Positional encoding: RoPE and ALiBi enable models to understand sequence order and extrapolate length
Mixture of Experts: Sparse activation allows massive models with efficient inference
Scale considerations: Bigger isn't always better—match model to task, consider active vs. total parameters

Next session (Week 2.2): We'll explore how these architectures are actually trained—the pre-training process, optimization techniques, and the computational resources required to create LLMs from scratch.